navigation goal
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Singapore (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Information Technology (0.70)
- Transportation > Ground > Road (0.70)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Singapore (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Information Technology (0.70)
- Transportation > Ground > Road (0.70)
Let Humanoids Hike! Integrative Skill Development on Complex Trails
Hiking on complex trails demands balance, agility, and adaptive decision-making over unpredictable terrain. Current humanoid research remains fragmented and inadequate for hiking: locomotion focuses on motor skills without long-term goals or situational awareness, while semantic navigation overlooks real-world embodiment and local terrain variability. We propose training humanoids to hike on complex trails, driving integrative skill development across visual perception, decision making, and motor execution. We develop a learning framework, LEGO-H, that enables a vision-equipped humanoid robot to hike complex trails autonomously. We introduce two technical innovations: 1) A temporal vision transformer variant - tailored into Hierarchical Reinforcement Learning framework - anticipates future local goals to guide movement, seamlessly integrating locomotion with goal-directed navigation. 2) Latent representations of joint movement patterns, combined with hierarchical metric learning - enhance Privileged Learning scheme - enable smooth policy transfer from privileged training to onboard execution. These components allow LEGO-H to handle diverse physical and environmental challenges without relying on predefined motion patterns. Experiments across varied simulated trails and robot morphologies highlight LEGO-H's versatility and robustness, positioning hiking as a compelling testbed for embodied autonomy and LEGO-H as a baseline for future humanoid development.
- North America > United States > Michigan (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
Zhao, Baining, Fang, Jianjie, Dai, Zichao, Wang, Ziyou, Zha, Jirong, Zhang, Weichen, Gao, Chen, Wang, Yue, Cui, Jinqiang, Chen, Xinlei, Li, Yong
Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.
- Transportation > Ground > Road (1.00)
- Health & Medicine (0.76)
- Transportation > Infrastructure & Services (0.68)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.46)
Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration
Shen, Zhixuan, Luo, Haonan, Chen, Kexun, Lv, Fengmao, Li, Tianrui
Understanding how humans cooperatively utilize semantic knowledge to explore unfamiliar environments and decide on navigation directions is critical for house service multi-robot systems. Previous methods primarily focused on single-robot centralized planning strategies, which severely limited exploration efficiency. Recent research has considered decentralized planning strategies for multiple robots, assigning separate planning models to each robot, but these approaches often overlook communication costs. In this work, we propose Multimodal Chain-of-Thought Co-Navigation (MCoCoNav), a modular approach that utilizes multimodal Chain-of-Thought to plan collaborative semantic navigation for multiple robots. MCoCoNav combines visual perception with Vision Language Models (VLMs) to evaluate exploration value through probabilistic scoring, thus reducing time costs and achieving stable outputs. Additionally, a global semantic map is used as a communication bridge, minimizing communication overhead while integrating observational results. Guided by scores that reflect exploration trends, robots utilize this map to assess whether to explore new frontier points or revisit history nodes. Experiments on HM3D_v0.2 and MP3D demonstrate the effectiveness of our approach. Our code is available at https://github.com/FrankZxShen/MCoCoNav.git.
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
- Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.54)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.46)
Unified Understanding of Environment, Task, and Human for Human-Robot Interaction in Real-World Environments
Yano, Yuga, Mizutani, Akinobu, Fukuda, Yukiya, Kanaoka, Daiju, Ono, Tomohiro, Tamukoh, Hakaru
To facilitate human--robot interaction (HRI) tasks in real-world scenarios, service robots must adapt to dynamic environments and understand the required tasks while effectively communicating with humans. To accomplish HRI in practice, we propose a novel indoor dynamic map, task understanding system, and response generation system. The indoor dynamic map optimizes robot behavior by managing an occupancy grid map and dynamic information, such as furniture and humans, in separate layers. The task understanding system targets tasks that require multiple actions, such as serving ordered items. Task representations that predefine the flow of necessary actions are applied to achieve highly accurate understanding. The response generation system is executed in parallel with task understanding to facilitate smooth HRI by informing humans of the subsequent actions of the robot. In this study, we focused on waiter duties in a restaurant setting as a representative application of HRI in a dynamic environment. We developed an HRI system that could perform tasks such as serving food and cleaning up while communicating with customers. In experiments conducted in a simulated restaurant environment, the proposed HRI system successfully communicated with customers and served ordered food with 90\% accuracy. In a questionnaire administered after the experiment, the HRI system of the robot received 4.2 points out of 5. These outcomes indicated the effectiveness of the proposed method and HRI system in executing waiter tasks in real-world environments.
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
- South America > Brazil (0.04)
- Consumer Products & Services > Restaurants (1.00)
- Education > Educational Setting (0.70)
OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following
Shi, Haochen, Sun, Zhiyuan, Yuan, Xingdi, Côté, Marc-Alexandre, Liu, Bang
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
- North America > United States (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
From Prediction to Planning With Goal Conditioned Lane Graph Traversals
Hallgarten, Marcel, Stoll, Martin, Zell, Andreas
The field of motion prediction for automated driving has seen tremendous progress recently, bearing ever-more mighty neural network architectures. Leveraging these powerful models bears great potential for the closely related planning task. In this letter we propose a novel goal-conditioning method and show its potential to transform a state-of-the-art prediction model into a goal-directed planner. Our key insight is that conditioning prediction on a navigation goal at the behaviour level outperforms other widely adopted methods, with the additional benefit of increased model interpretability. We train our model on a large open-source dataset and show promising performance in a comprehensive benchmark.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Automobiles & Trucks (0.89)
- Transportation > Ground > Road (0.68)
AR Point&Click: An Interface for Setting Robot Navigation Goals
Gu, Morris, Croft, Elizabeth, Cosgun, Akansel
This paper considers the problem of designating navigation goal locations for interactive mobile robots. We investigate a point-andclick interface, implemented with an Augmented Reality (AR) headset. The cameras on the AR headset are used to detect natural pointing gestures performed by the user. The selected goal is visualized through the AR headset, allowing the users to adjust the goal location if desired. We conduct a user study in which participants set consecutive navigation goals for the robot using three different interfaces: AR Point&Click, Person Following and Tablet (birdeye map view). Results show that the proposed AR Point&Click interface improved the perceived accuracy, efficiency and reduced mental load compared to the baseline tablet interface, and it performed on-par to the Person Following method. These results show that the AR Point&Click is a feasible interaction model for setting navigation goals.
- Oceania > Australia (0.04)
- North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.69)
A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search
Trabucco, Brandon, Sigurdsson, Gunnar, Piramuthu, Robinson, Sukhatme, Gaurav S., Salakhutdinov, Ruslan
Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxel-based semantic map, and semantic search policy to efficiently find objects that need to be rearranged. On the AI2-THOR Rearrangement Challenge, our method improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies from 0.53% correct rearrangement to 16.56%, using only 2.7% as many samples from the environment.
- North America > United States > California (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)